Adaptive learning system and method

ABSTRACT

The invention provides a neural network module comprising an input layer comprising one or more input nodes arranged to receive input data, a rule base layer comprising one or more rule nodes, an output layer comprising one or more output nodes, and an adaptive component arranged to aggregate selected two or more rule nodes in the rule base layer based on the input data. The invention also provides an adaptive learning system comprising one or more of the neural network modules of the invention. The invention further provides related methods of implementing a neural network module an adaptive learning system, and a neural network computer program.

FIELD OF INVENTION

[0001] The invention relates to an adaptive learning system and methodand in particular relates to a neural network module forming part of anadaptive learning system.

BACKGROUND TO INVENTION

[0002] Real world problems, such as massive biological data analysis andknowledge discovery, adaptive speech recognition and life-long languageacquisition, adaptive intelligent prediction and control systems,intelligent agent-based systems and adaptive agents on the Web, mobilerobots, visual monitoring systems, multi-modal information processing,intelligent adaptive decision support systems, adaptive domesticappliances and intelligent buildings, systems that learn and controlbrain and body states from a biofeedback, systems which classifybio-informatic data, and other systems require sophisticated solutionsfor building on-line adaptive knowledge base systems.

[0003] Such systems should be able to learn quickly from a large amountof data, adapt incrementally in an on-line mode, have an open structureso as to allow dynamic creation of new modules, memorise informationthat can be used later, interact continuously with the environment in a“life-long” learning mode, deal with knowledge as well as with data, andadequately represent space and time in their structure.

[0004] Well established neural network and artificial intelligence (AI)techniques have difficulties when applied for on-line knowledge basedlearning. For example, multi-layer perceptrons (MLP) and backpropagationlearning algorithms have a number of problems, for example catastrophicforgetting, local minima problem, difficulties in extracting rules,inability to adapt to new data without retraining on old data, andexcessive training times when applied to large data sets.

[0005] The self-organising map (SOM) may not be efficient when appliedfor unsupervised adaptive learning on new data, as the SOM assumes afixed structure and a fixed grid of nodes connected in a topologicaloutput space that may not be appropriate to project a particular dataset. Radial basis neural networks require clustering to be performedfirst and then the back propagation algorithm applied. Neuro-fuzzysystems cannot update the learned rules through continuous training onadditional data without catastrophic forgetting.

[0006] These type of networks are not efficient for adaptive, on-linelearning, although they do provide an improvement over prior techniques.

SUMMARY OF INVENTION

[0007] In one form the invention comprises a neural network modulecomprising an input layer comprising one or more input nodes arranged toreceive input data; a rule base layer comprising one or more rule nodes;an output layer comprising one or more output nodes; and an adaptivecomponent arranged to aggregate selected two or more rule nodes in therule base layer based on the input data.

[0008] In another form the invention comprises a method of implementinga neural network module comprising the steps of arranging an input layercomprising one or more input nodes to receive input data; arranging arule base layer comprising one or more rule nodes; arranging an outputlayer comprising one or more output nodes; and arranging an adaptivecomponent to aggregate selected two or more rule nodes in the rule baselayer based on the input data.

[0009] In a further form the invention comprises a neural networkcomputer program comprising an input layer comprising one or more inputnodes arranged to receive input data; a rule base layer comprising oneor more rule nodes; an output layer comprising one or more output nodes;and an adaptive component arranged to aggregate selected two or morerule nodes in the rule base layer based on the input data.

BRIEF DESCRIPTION OF THE FIGURES

[0010] Preferred forms of the adaptive learning system and method willnow be described with reference to the accompanying figures in which:

[0011]FIG. 1 is a schematic view of hardware on which one form of theinvention may be implemented;

[0012]FIG. 2 is a further schematic view of an adaptive learning systemof the invention;

[0013]FIG. 3 is a schematic view of a neural network module of FIG. 2;

[0014]FIG. 4 is an example of membership functions for use with theinvention;

[0015]FIG. 5 is an example of a rule mode of the invention;

[0016]FIG. 6 illustrates the adjustment and learning process relating tothe rule node of FIG. 5;

[0017]FIG. 7 shows an adaptive learning system of the invention havingthree rule nodes;

[0018]FIG. 8 shows one method of aggregating the rule nodes of FIG. 7;

[0019]FIG. 9 illustrates another method of aggregating the three rulenodes of FIG. 7;

[0020]FIGS. 10 and 11 illustrate the aggregation of spatially allocatedrule nodes;

[0021]FIGS. 12 and 13 illustrate the aggregation of linearly allocatedrule nodes;

[0022] FIGS. 14 to 17 illustrate different allocation strategies for newrule nodes;

[0023]FIGS. 18A and 18B illustrate the system learning a complex timeseries chaotic function;

[0024]FIG. 19 is a table of selected rules extracted from a systemtrained on the function of FIG. 18;

[0025]FIGS. 20 and 21 illustrate the system learning from time seriesdata examples;

[0026]FIGS. 22 and 23 illustrate unsupervised continuous learning by thesystem;

[0027]FIG. 24 illustrates evolved rule nodes and the tajectory of aspoken word ‘zoo’ in the two dimensional space of the first twoprincipal components in a system trained with a mix of spoken words inNZ English and Maori;

[0028]FIG. 25 illustrates comparative analysis of the learning model ofthe system with other models;

[0029]FIG. 26 is a table showing global test accuracy of a known methodcompared with the system of the invention;

[0030]FIG. 27 illustrates a rule from a set of rules extracted from anevolved system from a sequence of biological data for the identificationof a splice junction between introns and exons in a gene; and

[0031]FIG. 28 illustrates a rule from a set of rules extracted from anevolved system from a micro-array gene expression data taken from twotypes—ALL and AML of leukaemia cancer tissues.

DETAILED DESCRIPTION OF PREFERRED FORMS

[0032]FIG. 1 illustrates preferred form hardware on which one form ofthe invention may be implemented. The preferred system 2 comprises adata processor 4 interfaced to a main memory 6, the processor 4 and thememory 6 operating under the control of appropriate operating andapplication software or hardware. The processor 4 could be interfaced toone or more input devices 8 and one or more output devices 10 with anI/O controller 12. The system 2 may further include suitable massstorage devices 14 for example, floppy, hard disk or CD Rom drives orDVD apparatus, a screen display 16, a pointing device 17, a modem 18and/or network controller 19. The various components could be connectedvia a system bus or over a wired or wireless network.

[0033] In one form the invention could be arranged for use in speechrecognition and to be trained on model speech signals. In this form, theinput device(s) 8 could comprise a microphone and/or a further storagedevice in which audio signals or representations of audio signals arestored. The output device(s) 10 could comprise a printer for displayingthe speech or language process by the system, and/or a suitable speakerfor generating sound. Speech or language could also be displayed ondisplay device 16.

[0034] Where the invention is arranged to classify bio-informatics casestudy data, this data could be stored in a mass storage device 14,accessed by the processor 4 and the results displayed on a screendisplay 16 and/or a further output device 10.

[0035] Where the system 2 is arranged for use with a mobile robot, theinput device(s) 8 could include sensors or other apparatus arranged toform representations of an environment. The input devices could alsoinclude secondary storage in which a representation of an environment isstored. The output device(s) 10 could include a monitor or visualdisplay unit to display the environment processed by the system. Theprocessor 4 could also be interfaced to motor control means to transportthe robot from one location in the processed environment to anotherlocation.

[0036] It will be appreciated that the adaptive learning system 2 couldbe arranged to operate in many different environments and to solve manydifferent problems. In each case, the system 2 evolves its structure andfunctionality over time through interaction with the environment throughthe input devices 8 and the output devices 10.

[0037]FIG. 2 illustrates the computer-implemented aspects of theinvention stored in memory 6 and/or mass storage 14 and arranged tooperate with processor 4. The preferred system is arranged as anevolving connectionist system 20. The system 20 is provided with one ormore neural network modules or NNM 22. The arrangement and operation ofthe neural network module(s) 22 forms the basis of the invention andwill be further described below.

[0038] The system includes a representation or memory component 26comprising one or more neural network modules 22. The representationcomponent 26 preferably includes an adaptation component 28 as will beparticularly described below which enables rule nodes to be insertedextracted and/or aggregated.

[0039] The system 20 may include a number of further known components,for example a feature selection component 24 arranged to performfiltering of the input information, feature extraction and forming theinput vectors.

[0040] The system may also include a higher level decision component 30comprising one or more modules which receive feedback from theenvironment 34, an action component 32 comprising one or more moduleswhich take output values from the decision component and pass outputinformation to the environment 34, and a knowledge base 36 which isarranged to extract compressed abstract information from therepresentation component 26 and the decision component 30 in the form ofrules, abstract associations and other information. The knowledge base36 may use techniques such as genetic algorithms or other evolutionarycomputation techniques to evaluate and optimise the parameters of thesystem during its operation.

[0041]FIG. 3 illustrates one preferred form of neural network module 22.The preferred structure is a fuzzy neural network which is aconnectionist structure which implements fuzzy rules. The neural networkmodule 22 includes input layer 40 having one or more input nodes 42arranged to receive input data.

[0042] The neural network module 22 may further comprise fuzzy inputlayer 44 having one or more fuzzy input nodes 46. The fuzzy input nodes46 transform data from the input nodes 42 for the further use of thesystem. Each of the fuzzy input nodes 46 could have a differentmembership function attached to it. One example of a membership functionis the triangular membership function shown in FIG. 4. The membershipfunction could also include Gaussian functions or any other knownfunctions suitable for the purpose. The system is preferably arranged sothat the number and type of the membership function may be dynamicallymodified as will be described further below. The main purpose of thefuzzy input nodes 46 is to transform the input values from the inputnodes 42 into membership degrees to which the values belong to themembership function.

[0043] The neural network module 22 further comprises rule base layer 48having one or more rule nodes 50. Each rule node 50 is defined by twovectors of connection weights W1(r) and W2(r). Connection weight W1(r)is preferably adjusted through unsupervised learning based on similaritymeasure within a local area of the problem space. W2(r), on the otherhand, is preferably adjusted through supervised learning based on outputerror, or on reinforcement learning based on output hints. Connectionweights W1(r) and W2(r) are further described below.

[0044] The neural network module 22 may further comprise a fuzzy outputlayer 52 having one or more fuzzy output nodes 54. Each fuzzy node 54represents a fuzzy quantisation of the output variables, similar to thefuzzy input nodes 46 of the fuzzy input layer 54. Preferably, a weightedsum input function and a saturated linear activation function are usedfor the nodes to calculate the membership degrees to which the outputvector associated with the presented input vector belongs to each of theoutput membership functions.

[0045] The neural network module also includes output layer 56 havingone or more output nodes 58. The output nodes 58 represent the realvalues of the output variables. Preferably a linear activation functionis used to calculate the de-fuzzified values for the output variables.

[0046] The preferred form rule base layer 48 comprising one or more rulenodes 50 representing prototypes of input-output data associations thatcan be graphically represented as associations of hyper-spheres from thefuzzy input layer 44 spaces and the fuzzy output layer 52 spaces. Eachrule node 50 has a minimum activation threshold which is preferablydetermined by a linear activation function.

[0047] As shown in FIG. 3, the neural network module 22 may also includea short-term memory layer 60 having one or more memory nodes 62. Thepurpose of the short-term memory layer 60 is to memorise structurallytemporal relationships of the input data. The short-term memory layer ispreferably arranged to receive information from and send information tothe rule base layer 48.

[0048] As described above, each rule node 50 represents an associationbetween a hyper-sphere from the fuzzy input space and a hyper-spherefrom the fuzzy output space. These spheres are described with referenceto FIG. 5, which illustrates example rule node 70 shown as r_(j). Rulenode r_(j) has an initial hyper-sphere 72 in the fuzzy input space. Therule node r_(j) has a sensitivity threshold parameter S_(j) whichdefines the minimum activation threshold of the rule node r_(j) to a newinput vector x from a new example or input (x,y) in order for theexample to be considered for association with this rule node. A newinput vector x activates a rule node if x satifies the minimum actuationthreshold and is subsequently considered for association with the rulenode. The radius of the input hyper-sphere 72 is defined asR_(j)=1−S_(j), S_(j) being the sensitivity threshold parameter.

[0049] Rule node r_(j) has a matrix of connection weights W1 (r_(j))which represents the coordinates of the centre of the sphere 72 in thefuzzy input space. Rule node r_(j) also has a fuzzy output spacehyper-sphere 74, the coordinates of the centre of the sphere 74 beingconnection weights W2 (r_(j)). The radius of the output hyper-sphere 74is defined as E which represents the error threshold or error toleranceof the rule node 70. In this way it is possible for some rule nodes tobe activated more strongly than other rule nodes by input data.

[0050] A new pair of data vectors (x,y) is transformed to fuzzyinput/output data vectors (x_(f), y_(f)) which will be allocated to therule node 70 if x_(f) falls within input hyper-sphere 72 and y_(f) fallswithin the output hyper-sphere 74 when the input vector x is propagatedthrough the input node. The distance of x_(f) from the centre of inputhyper-sphere 72 and the distance of y_(f) from the centre of outputhyper-sphere 74 provides a basis for calculating and assigning themagnitude or strength of activation. This strength of activationprovides a basis for comparing the strengths of activation of differentrule nodes. Therefore a further basis for allocation is where the rulenode 70 receives the strongest activation among other rule nodes. Thedata vectors (x_(f), y_(f)) will be associated with rule node 70 if thelocal normalised fuzzy difference between x_(f) and W1 (r_(j)) issmaller than the radius R_(j), and the normalised output errorErr=∥y−y′∥/Nout is smaller than an error threshold E, Nout is the numberof the outputs and y′ is produced by the system output. The E parametersets the error tolerance of the system.

[0051] In the preferred method a local normalised fuzzy difference(distance) between two fuzzy membership vectors d_(1f) and d_(2f) thatrepresent the membership degrees to which two real vector data d₁ and d₂belong to pre-defined MFs, is calculated as:

D(d _(1f) ,d _(2f))=∥d _(1f) −d _(2f) ∥/∥d _(1f) +d _(2f)∥  (1)

[0052] where: ∥x−y∥ denotes the sum of all the absolute values of avector that is obtained after vector subtraction (or summation in caseof ∥x+y∥) of two vectors x and y; “/” denotes division. For example, ifd_(1f)=(0, 0, 1, 0, 0, 0) and d_(2f)=(0, 1, 0, 0, 0, 0), thenD(d₁,d₂)=(1+1)/2=1 which is the maximum value for the local normalisedfuzzy difference.

[0053] As new inputs are fed to rule node 70, these data inputs relevantto r_(j) may be associated with rule node 70 providing an opportunityfor learning. As new fuzzy input/output data vector (x_(f), y_(f)) isfed to the rule node 70, the centre of the input hyper-sphere 72 isadjusted to a new sphere indicated at 72A by adjusting W1 (r_(j) ⁽¹⁾) toW1 (r_(j) ⁽²⁾). The output hyper-sphere 74 is also adjusted to newsphere as shown at 74A by adjusting W2 (r_(j) ⁽¹⁾) to W2 (r_(j) ⁽²⁾).

[0054] The centres of the node hyper-spheres are adjusted in the fuzzyinput space depending on the distance between the new input vector andthe rule node through a learning rate l_(j), a parameter that isindividually adjusted for each rule node. The adjustment of thehyper-spheres in the fuzzy output space depends on the output error andalso on the learning rate l_(j) through the Widrow-Hoff LMS algorithm,also called the Delta algorithm.

[0055] This adjustment in the input and in the output spaces can berepresented mathematically by the change in the connection weights ofthe rule node r_(j) from W1(r_(j) ⁽¹⁾) and W2(r_(j) ⁽¹⁾) to W1(r_(j)⁽²⁾) and W2(r_(j) ⁽²⁾) respectively according to the following vectoroperations:

W 1(r _(j) ⁽²⁾)=W 1(r _(j) ⁽¹⁾)+l _(j).(W 1(r _(j) ⁽¹⁾)−x _(f))   (2)

W 2(r _(j) ⁽²⁾)=W 2(r _(j) ⁽¹⁾)+l _(j).(A 2−y _(f)).A 1(r _(j) ⁽¹⁾)  (3)

[0056] where: A2=f₂(W2.A1) is the activation vector of the fuzzy outputneurons when the input vector x is presented; A1(r_(j)⁽¹⁾)=f1(D(W1(r_(j) ⁽¹⁾),x_(f))) is the activation of the rule node r_(j)⁽¹⁾; a simple linear function can be used for f₁ and f₂, e.g. A1(r_(j)⁽¹⁾)=1−D(W1(r_(j) ⁽¹⁾),x_(f))), where D is the fuzzy normalised distancemeasure; l_(j) is the current learning rate of the rule node r_(j)calculated as l_(j)=1/Nex(r_(j)), where Nex(r_(j)) is the number ofexamples currently associated with rule node r_(j). The statisticalrationale behind this is that the more examples that are currentlyassociated with a rule node the less it will “move” when a new examplehas to be accommodated by this rule node, i.e. the change in the rulenode position is proportional to the number of already associatedexamples which is a statistical characteristic of the method.

[0057] When a new example is associated with a rule node r_(j) not onlyits location in the input space changes, but also its receptive fieldexpressed as its radius Rj, and its sensitivity threshold Sj:

Rj ⁽²⁾ =Rj ⁽¹⁾ +D(W 1(r _(j) ⁽²⁾), W 1(r _(j) ⁽¹⁾)), Rj ⁽²⁾<=Rmax  (4)

respectively  (3)

Sj ⁽²⁾ =Sj ⁽¹⁾ −D(W 1(r _(j) ⁽²⁾), W 1(r _(j) ⁽¹⁾))  (5)

[0058] where Rmax is a parameter set to restrict the maximum radius ofthe receptive field of a rule node.

[0059] The adjustment and learning process in the fuzzy input space isillustrated in FIG. 6 which schematically illustrates how the centrer_(j) ⁽¹⁾ 82 of the rule node r_(j) 80 adjusts, after learning each newdata point, to its new position r_(j) ⁽⁴⁾ 84 based on one pass learningon the four data points d₁, d₂, d₃ and d₄.

[0060] The adaptation component of the preferred system enables rulenodes to be inserted, extracted and adapted or aggregated as will bedescribed below. At any time or phase of the evolving or learningprocess, fuzzy or exact rules may be inserted by setting a new rule noder_(j) for each new rule, such that the connection weights W1 (r_(j)) andW2 (r_(j)) of the rule node represent this rule.

[0061] For example, the fuzzy rule (IF x₁ is Small and x₂ is Small THENy is Small) may be inserted into the neural network module 22 by settingthe connections of a new rule node to the fuzzy condition nodes x1−Smalland x2−Small and to the fuzzy output node y−Small to a value of 1 each.The rest of the connections are set to a value of 0.

[0062] Similarly, an exact rule may be inserted into the module 22, forexample IF x₁ is 3.4 and x₂ is 6.7 THEN y is 9.5. Here, the membershipdegrees to which the input values x₁=3.4 and x₂=6.7 and the output valuey=9.5 belong to the corresponding fuzzy values are calculated andattached to the corresponding connection weights.

[0063] The preferred adaptation component also permits rule extractionin which new rules and relationships are identified by the system. Eachrule node r_(j) can be expressed as a fuzzy rule, for example:

[0064] Rule r: IF x₁ is Small 0.85 and x₁ is Medium 0.15 and x₂ is Small0.7 and x₂ is Medium 0.3 {radius of the receptive field of the rule r is0.5}

[0065] THEN y is Small 0.2 and y is Large 0.8 {Nex(r) examplesassociated with this rule out of Nsum total examples learned by thesystem}.

[0066] The numbers attached to the fuzzy labels denote the degree towhich the centres of the input and the output hyper-spheres belong tothe respective membership functions.

[0067] The adaptation component preferably also permits rule nodeaggregation. Through this technique, several rule nodes are merged intoone as is shown in FIGS. 7, 8 and 9 on an example of 3 rule nodes r₁, r₂and r₃.

[0068]FIG. 7 illustrates a neural network module similar to the moduleof FIG. 3. The module may comprise, for example, an input layer 40, afuzzy input layer 44, a rule base layer 48, a fuzzy output layer 52 andan output layer 56. The rule base layer 48 includes, for example, rulenodes r₁, r₂ and r₃ indicated at 90, 92 and 94 respectively.

[0069] For the aggregation of these three rule nodes r₁, r₂, and r₃ thefollowing two aggregation strategies can be used to calculate the newaggregated rule node r_(agg), W1 connections (the same formulae are usedto calculate the W2 connections):

[0070] as a geometrical centre of the three nodes:

W 1(r _(agg))=(W 1(r ₁)+W 1(r ₂)+W 1(r ₃))/3  (6)

[0071] as a weighted statistical centre:

W 1(r _(agg))=(W 1(r ₁).Nex(r ₁)+W 1(r ₂). Nex(r ₂)+W 1(r ₃).Nex(r₃))/Nsum  (7)

Nex(r _(agg))=Nsum=Nex(r ₁)+Nex(r ₂)+Nex(r ₃);  (8)

Rr _(agg) =D(W 1(r _(agg)), W 1(r _(j)))+Rj<=Rmax;  (9)

[0072] where r_(j) is the rule node from the three nodes that has amaximum distance from the new node r_(agg) and Rj is its radius of thereceptive field. The three rule nodes will aggregate only if the radiusof the aggregated node receptive field is less than a pre-definedmaximum radius Rmax.

[0073]FIG. 8 shows an example of aggregation as a geometrical centre ofthe three nodes whereas FIG. 9 shows aggregation as a weightedstatistical centre.

[0074] In order for a given node r_(j) to “choose” the other nodes withwhich it should aggregate, two subsets of nodes are formed—the subset ofnodes r_(k) that if activated to a degree of 1 will produce an outputvalue y′(r_(k)) that is different from y′(r_(j)) in less than the errorthreshold E, and the subset of nodes that cause output values differentfrom y′(r_(k)) in more than the error threshold E. The W2 connectionsdefine these subsets. All the rule nodes from the first subset that arecloser to r_(j) in the input space than the closest to r_(j) node fromthe second subset in terms of W1 distance, get aggregated if thecalculated radius of the new node r_(agg) is less than the pre-definedlimit Rmax for a receptive field as illustrated on FIG. 9.

[0075] Instead of aggregating all the rule nodes that are closer to arule node r_(j) than the closest node from the other class, it ispossible to keep the closest node from the aggregation pool to the otherclass out of the aggregation procedure—as a separate node—a “guard”, asshown in FIGS. 10, 11, 12 and 13, thus preventing futuremisclassification on the bordering area between the two classes.

[0076] The aggregation of spatially allocated rule nodes is describedwith reference to FIGS. 10 and 11. Referring to FIG. 10, two distinctsets of rule nodes have been selected and sorted for aggregation, showngenerally as 100 and 102 respectively. Referring to FIG. 11, rule node104 is classified as a guard and is not aggregated. The remaining rulenodes in set 100 are aggregated into new rule 106. Similarly, rule node108 is not aggregated with remaining aggregated rule nodes in set 102shown at 110. In accordance with the invention, the sensitivitythreshold and error threshold of rule nodes 104 and 108 are decreased toincrease the activation threshold of these nodes resulting in aggregatednodes 106 and 110 being activated in preference to guard nodes 104 and108.

[0077]FIGS. 12 and 13 illustrate the same process of aggregation as thatdescribed in FIGS. 10 and 11 with the exception that the rule nodes arelinearly allocated rather than spatially allocated, as they are in FIGS.10 and 11.

[0078] Aggregation in accordance with the invention is preferablyperformed after a certain number of examples are presented (parameterN_(agg)) over the whole set of rule nodes.

[0079] In a further preferred form the system nodes r₁ that are notaggregated may decrease their sensitivity threshold S₁ and increasetheir radius R₁ with a small coefficient in order for these nodes tohave more chances to win the activation competition for the next inputdata examples and compete with the rest of the nodes.

[0080] Through node creation and consecutive aggregation, the preferredneural network module 22 may adjust over time to changes in the datastream and at the same time preserve its generalisation capabilities.

[0081] After a certain time (when certain number of data examples havebeen presented to the system) some neurons and connections may bepruned. Different pruning rules can be applied for a successful pruningof unnecessary nodes and connections. One of them is given below:

[0082] IF (Age(r_(j))>OLD)AND(the total activation TA(r_(j)) is lessthan a pruning parameter Pr times Age (r_(j))) THEN prune rule noder_(j),

[0083] where Age(r_(j)) is calculated as the number of examples thathave been presented to the system after r_(j) had been first created;OLD is a pre-defined “age” limit; Pr is a pruning parameter in the range[0,1], and the total activation TA(r_(j)) is calculated as the number ofexamples for which r_(j) has been the correct winning node (or among them winning nodes in the m-of-n mode of operation).

[0084] The above pruning rule requires that the fuzzy concepts of OLD,HIGH, etc. are defined in advance. As a partial case, a crisp value canbe used, e.g. a node is OLD if it has existed during the evolving of asystem from more than p examples. The pruning rule and the way thevalues for the pruning parameters are defined, depend on the applicationtask.

[0085] Parameters of each rule node may be either kept fixed during theentire operation of the system, or can be adapted or optimised accordingto the incoming data. Adaptation may be achieved through the analysis ofthe behaviour of the system and through a feedback connection from thehigher level modules. Genetic algorithms and evolutionary programmingtechniques can also be applied to optimise the structural and functionalparameters of the neural network module 22.

[0086] In a further preferred form of the invention, a population of ssystems is evolved simultaneously, each system having differentparameter values. A certain “window” of incoming data is kept andupdated for testing the fitness of the individually evolved system basedon mean square error fitness function. The best system is selected and“multiplied” through small deviations of the parameter values thuscreating the next generation of population. The process is continuous inan unlimited way in time.

[0087] In terms of implementing the method and the system in a computermemory, when created, new rule nodes are either spatially or linearlyallocated in the computer memory and the actual allocation of nodescould follow one of a number of different strategies as is describedbelow.

[0088] One such strategy, as shown in FIG. 14, could be a simpleconsecutive allocation strategy. Each newly created rule node isallocated in the computer memory next to the previous and to thefollowing rule nodes, in a linear fashion, representing a time order.

[0089] Another possible strategy could be a pre-clustered location asshown in FIG. 15. For each output fuzzy node, there is a pre-definedlocation in the computer memory where the rule nodes supporting thispre-defined concept are located. At the centre of this area, the nodesthat fully support this concept are placed. Every new rule node'slocation is defined based on the fuzzy output error and the similaritywith other nodes. In a nearest activated node insertion strategy, a newrule node is placed nearest to the highly activated node whichactivation is still less than its sensitivity threshold. The side (leftor right) where the new node is inserted, is defined by the highestactivation of the two neighbouring nodes.

[0090] A further strategy could include the pre-clustered locationdescribed above further including temporal feedback connections betweendifferent parts of the computer memory loci, as shown in FIG. 16. Newconnections are set that link consecutively activated rule nodes throughusing the short term memory and the links established through the W3weight matrix. This will allow the neural network module 22 to repeat asequence of data points starting from a certain point and notnecessarily from the beginning.

[0091] A further strategy could include the additional feature that newconnections are established between rule nodes from different neuralnetwork modules that become activated simultaneously, as shown in FIG.17. This feature would enable the system to learn a correlation betweenconceptually different variables, for example the correlation betweenspeech sound and lip movement.

[0092] An important feature of the adaptive learning system and methoddescribed above is that learning involves local element tuning. Only onerule node (or a small number, if the system operates in m-of-n mode)will be updated for each data example, or alternatively only one rulenode will be created. This speeds up the learning procedure,particularly where linear activation functions are used in the neuralnetwork modules. A further advantage is that learning a new data exampledoes not cause forgetting of old examples. Furthermore, new input andnew output variables may be added during the learning process, therebymaking the adaptive learning system more flexible to accommodate newinformation without disregarding already learned information.

[0093] The use of membership functions, membership degrees andnormalised local fuzzy distance enables the system to deal with missingattribute values. In such cases, the membership degrees of allmembership functions will be 0.5 indicating that the value, if itexisted, may belong equally to them. Preference, in terms of which fuzzymembership functions the missing value may belong to, can also berepresented through assigning appropriate membership degrees.

[0094] The preferred supervised learning algorithms of the inventionenable the system to continually evolve and learn when a newinput-output pair of data becomes available. This is known as an activemode of learning. In another mode, passive learning, learning isperformed when there is no input pattern presented. Passive learningcould be conducted after an initial learning. When passive learning,existing connections that store previously fed input patterns are usedas “echo” to reiterate the learning process. This type of learning couldbe applied in case of a short presentation time of the data, when only asmall portion of the data is learned in one pass online mode and thenthe training is refined through the echo learning method. The storedpatterns in the W1 connection weights can be used as input vectors forthe system refinement with the W2 patterns indicating what the outputswill be.

[0095] Two preferred supervised learning algorithms are described below.Each learning algorithm differs in the weight adjustment formulae.

[0096] The first learning algorithm is set out below:

[0097] Set initial values for the system parameters: number ofmembership functions; initial sensitivity thresholds (default Sj=0.9);error threshold E; aggregation parameter Nagg−number of consecutiveexamples after each aggregation is performed; pruning parameters OLD anPr; a value for m (in m-of-n mode); maximum radius limit Rmax;thresholds T₁ and T₂ for rule extraction. Set the first rule node r₀ tomemorise the first example (x,y); W1(r₀)=x_(f), and W2(r₀)=y_(f); Loopover presentations of new input-output pairs (x,y) { Evaluate the localnormalised fuzzy distance D between x_(f) and the existing rule nodeconnections W1 (formulae (1)) Calculate the activation A1 of the rulenode layer. Find the closest rule node r_(k) (or the closest m rulenodes in case of m-of-n mode) to the fuzzy input vector x_(r) for whichA1(r_(k)) >= S_(k) (sensitivity threshold for the node r_(k)), if thereis no such a node, create a new rule node for (x_(f),y_(f)) else Findthe activation of the fuzzy output layer A2=W2.A1(1-D(W1,x_(r)))) andthe normalised output error Err= | | y- y‘| | / Nout. if Err > E createa new rule node to accommodate the current example (x_(f),y_(f)) elseUpdate W1 (r_(k)) and W2(r_(k)) according to (2) and (3) (in case ofm-of-n system update all the m rule nodes with the highest A1activation). Apply aggregation procedure of rule nodes after each groupof N_(agg) examples are presented Update the values for the rule noder_(k) parameters S_(k), R_(k), Age(r_(k)), TA (r_(k)). Prune rule nodesif necessary, as defined by pruning parameters. Extract rules from therule nodes { }

[0098] A modified version of the above algorithm is when the number ofthe winning rule nodes is chosen to be not 1, but m>1 (by default m=3).This mode is called “m-of-n”.

[0099] The second learning algorithm is different from the firstlearning algorithm in the weight adjustment formula for W2 as follows:

W 2(r _(j) ⁽²⁾)=W 2(r _(j) ⁽²⁾)+l _(j).(A 2−y _(f)). A 1(r _(j)⁽²⁾)  (11)

[0100] This means that after the first propagation of the input vectorand error Err calculation, if the weights are going to be adjusted, W1weights are adjusted first using equation (2) above and then the inputvector x is propagated again through the already adjusted rule noder_(j) to its position r_(j) ⁽²⁾ in the input space, a new errorErr=(A2−y_(f)) is calculated and after that the W2 weights of the rulenode r_(j) are adjusted. This is a finer weight adjustment than theadjustment in the first algorithm that may make a difference in learningshort sequences, but for learning longer sequences it may not manifestany difference in the results obtained through the simpler and fasterfirst algorithm.

[0101] In addition to supervised learning, the system is also preferablyarranged to perform unsupervised learning in which it is assumed thatthere are no desired output values available and the system evolves itsrule nodes from the input space. A node allocation is based only on thesensitivity thresholds S_(j) and on the learning rates l_(j). If a newdata item d activates a certain rule node (or nodes) above the level ofits parameter S_(j), then this rule node (or the one with the highestactivation) is adjusted to accommodate the new data item according toequation (2) above, or alternatively a new rule node is created. Theunsupervised learning method of the invention is based on the stepsdescribed above as part of the supervised learning method when only theinput vector x is available for the current input data item d.

[0102] Both the supervised and the unsupervised learning methods for thesystem are based on the same principles of building the W1 layer ofconnections. Either class of method could be applied on an evolvingsystem so that if there are known output values, the system will use asupervised learning method, otherwise it will apply the unsupervisedlearning method on the same structure. For example, after having evolvedin an unsupervised way, a neural network module from a spoken word ofinput data, the system may then use data labelled with the appropriatephoneme labels to continue the learning process of this system, now in asupervised mode.

[0103] The preferred system may also perform learning from output hints,or through reinforcement learning, in addition to the unsupervised orsupervised learning. This is the case when the exact, desired outputvalues do not become known for the purpose of adjusting the W2connection weights. Instead, fuzzy hints F given in fuzzy linguisticlabels that are used in the fuzzy output space may be given as afeedback, e.g. “low output value is the desired one” while the outputvalue produced by the system is “very low”. The system then calculatesthe fuzzy output error Errf=A2−F and then adjusts the connections W2through formula (3).

[0104] The preferred system may also perform inference and have theability to generalise on new input data. The inference method is part ofthe learning method when only the input vector x is propagated throughthe system. The system calculates the winner, or m winners, as follows:a winning rule node r for an input vector x is the node with: (i) thehighest activation A1(r) among other rule nodes for which, (ii):

D(x, W 1(r))<=Rr,  (12)

[0105] where: D(x, W1(r)) is the fuzzy normalised distance between x andW1(r); Rr is the radius of the rule node r. If there is no rule nodethat satisfies the condition (ii) for the current input vector x, onlycondition (i) is used to select the winner.

[0106] In a preferred form of the invention with reference to FIG. 3above, a temporal layer 60 of temporal nodes 62 captures temporaldependencies between consecutive data examples. If the winning rule nodeat the moment (t−1), to which the input data vector at the moment (t−1)is associated, is r_(max) ^((t−1)) and the winning node at the moment tis r_(max) ^((t)), then a link between the two nodes is established asfollows:

W 3(r _(max) ^((t−1)) ,r _(max) ^((t)))=W 3(r _(max) ^((t−1)) ,r _(max)^((t)))+l ₃ . A 1(r _(max) ^((t−1)))A 1(r _(max) ^((t)))  (13)

[0107] where A1(r^((t))) denotes the activation of a rule node r at atime moment (t) and l₃ defines the degree to which the neural networkmodule 22 associates links between rule nodes that include consecutivedata examples. If l₃=0, no temporal associations are learned in thestructure and the temporal layer 60 is effectively removed from theneural network module 22.

[0108] The learned temporal associations could be used to support theactivation of rule nodes based on temporal pattern similarity. Here,temporal dependencies are learned through establishing structural links.These dependencies can be further investigated and enhanced throughsynaptic analysis, at the synaptic memory level, rather than throughneuronal activation analysis at the behavioural level. The ratiospatial-similarity/temporal correlation can be balanced for differentapplications through two parameters S_(s) and T_(c), such that theactivation of a rule node r for a new data example d=(x,y) is definedthrough the following vector operations:

A 1(r)=|1−Ss.D(W 1(r),x _(f))+Tc. W 3(r _(max) ^((t−1)),r)|_([0,1])  (14)

[0109] where |.|_([0,1]) is bounded operation in the interval [0,1, andr_(max) ^((t−1)) is the winning neuron at the previous time moment. Heretemporal connections can be given a higher importance in order totolerate a higher distance in time for time-dependent input vectors. IfT_(c)=0, then temporal links are excluded from the functioning of thesystem.

[0110] The system is arranged to learn a complex chaotic functionthrough online evolving from one pass data propagation. The system isalso arranged to learn time series that change their dynamics throughtime and never repeat same patterns. Time series processes with changingdynamics could be of different origins, for example biological,environmental, industrial processes control, financial. The system couldalso be used for off-line training and testing similar to other standardneural network techniques.

[0111] An example of learning a complex chaotic function is describedwith reference to FIGS. 18A and 18B. Here, the system is used with theMackey-Glass chaotic time series data generated through the Mackey-Glasstime delay differential equation: $\begin{matrix}{\frac{(x)}{(t)} = {\frac{{ax}( {t - \tau} )}{1 + {x^{10}( {t - \tau} )}} - {b \times {(t).}}}} & (15)\end{matrix}$

[0112] This series behaves as a chaotic time series for some values ofthe parameters x (0) and τ. Here, x (0)=1.2, τ=17, a=0.2, b=0.1 and x(t)=0 for t<0. The input-output data for evolving the system from theMackey-Glass time series data has an input vector [x(t), x(t−6), (t−12),x(t−18)] and the output vector is [x(t+6)]. The task is to predictfuture values x(t+6) from four points spaced at six time intervals inthe past.

[0113] For the example, values for the system parameters are initiallyset as follows:

[0114] S=0.92, E=0.08, l=0.005, aggregation threshold is Rmax=0.15 andrule extraction thresholds T₁=T₂=0.1. Aggregation is performed aftereach consecutive group of N_(agg)=50 examples is presented.

[0115] Experimental results of the on-line evolving of the system areshown in FIGS. 18A and 18B. In particular, the desired versus predictedsix steps ahead values through one-pass on-line learning, the absolute,the local on-line RMSE (LRMSE) and the local on-line NDEI (LNDEI) errorover time as described below, the number of the rule nodes created andaggregated over time, and a plot of the input data vectors shown ascircles and the evolved rule nodes, the W1 connection weights shown ascrosses, projected in the two-dimensional input space of the first twoinput variables x(t) and x(t−6). It can be seen from FIGS. 18A and 18Bthat the number of the rule nodes is optimised after every 50 examplesare presented. The rule nodes are located in the input and the outputproblem spaces so that they represent cluster centres of the input datathat have similar output values subject to an error difference E.

[0116] The generalisation error of a neural network module on a next newinput vector (or vectors) from the input stream calculated through theevolving process is called local on-line generalisation error. The localon-line generalisation error at the moment t for example, when the inputvector is x(t) and the calculated by the evolved module output vector isy(t)′, is expressed as Err(t)=y(t)−y(t)′. The local on-line root meansquare error, and the local on-line non-dimensional error index LNDEI(t)can be calculated at each time moment t as:

LRMSE(t)={square root}(Σ_(j=1, 2 . . . , t)(Err(i)²)/t);LNDEI(t)=LRMSE(t)/std(y(1):y(t))  (16)

[0117] where std(y(1):y(t)) is the standard deviation of the output datapoints from 1 to t.

[0118] For the chosen values of the parameters, there were 16 rule nodesevolved each of them represented as one rule. Three of these rules areshown in FIG. 19, namely Rule 1, Rule 2 and Rule 16. These rules and thesystem inference mechanism define a system that is equivalent to theabove equation (16) in terms of the chosen inputs and output variablessubject to the calculated error.

[0119] As more input data is entered after certain time moment the LRMSEand LNDEI converge to constant values subject to a small error, in theexample from FIG. 19—LRMSE=0.043, LNDEI0.191. Generally speaking, in thecase of compact and bounded problem space the error can be madesufficiently small subject to appropriate selection of the parametervalues for the system and the initial data stream. In the experimentabove the chosen error tolerance was comparatively high, but theresulting system was compact. If the chosen error threshold E wassmaller (e.g. 0.05, or 0.02) more rule nodes would have been evolved andbetter prediction accuracy could have been achieved. Different neuralnetwork modules have different optimal parameter values, which dependson the task (e.g. time series prediction, classification).

[0120] A further example has been conducted in which the system has beenused for off-line training and testing. The following parameter valuesare initially set before the system is evolved, namely MF=5, S=0.92,E=0.02, m=3, l=0.005. The system is evolved on the first 500 dataexamples from the same Mackey-Glass time series as in the example abovefor one pass of learning. FIG. 20 shows the desired versus the predictedon-line values of the time-series. After the system is evolved, it istested for a global generalisation of the second 500 examples. FIG. 21shows the desired values versus the values predicted by the system in anoff-line mode.

[0121] In a general case, the global generalisation root mean squareerror (RMSE) and the non-dimensional error index are evaluated on a setof p new examples from the problem space as follows:

RMSE={square root}(Σ_(i=1, 2, . . . p)[(y _(i) −y _(i)′)² ]/p;NDEI=RMSE/std(1:p),  (17)

[0122] where std (1:p) is the standard deviation of the data from 1 to pin the test set. The evaluated data in this example RMSE is 0.01 and theNDEI is 0.046. After having evolved the system on a small butrepresentative part of the whole problem space, its globalgeneralisation error is sufficiently minimised.

[0123] The system is also tested for on-line test error on the test datawhile further training on it is performed. The on-line local test erroris slightly smaller.

[0124] In one experimental application the preferred system can be usedfor life-long unsupervised learning from a continuous stream of newdata. Such is the case of learning new sounds of new languages or newaccents unheard before. One experiment is described with reference toFIGS. 22 and 23. The system is presented with the acoustic features of aspoken word “eight” having a phonemic representation of/silence//ei//t//silence/. In the experimental results shown in FIG. 22,three time lags of 26 mel scale coefficient taken from a window of 12 msof the speech signal, with an overlap of 50%, are used to form78-element input vectors. The input vectors are plotted over time asshown in FIG. 23.

[0125] Each new input vector from the spoken word is either associatedwith an existing rule node that is modified to accommodate this data, ora new rule node is created. The rule nodes are aggregated at regularintervals which reduces the number of the nodes placed at the centres ofthe data clusters. After the whole word is presented, the aggregatedrule nodes represent the centres of the anticipated phoneme clusterswithout the concept of phonemes being introduced to the system.

[0126]FIGS. 22 and 23 show clearly that three rule nodes were evolvedafter aggregation that represent the input data. For example, frames 0to 53 indicated at 120 and frames 96 to 170 indicated at 122 areallocated to rule node 1 which represents the phoneme/silence/. Frames56 to 78 indicated at 124 are allocated to rule node 2 which representsthe phoneme/ei/. Frames 85 to 91 indicated at 126 are allocated to rulenode 3 which represents the phoneme/t/. The remaining frames representtransitional states. For example, frames 54 to 55 represent thetransition between /silence and /ei/. Frames 79 to 84 represent thetransition between /ei/ and /t/. Frames 92 to 96 represent thetransition between /t/ and /silence/. These frames are allocated to someof the closest rule nodes in the input space. If a higher sensitivitythreshold is used, this would have resulted in additional rule nodesevolved to represent these short transitional sounds.

[0127] When further pronunciations of the word “eight” or other wordsare presented to the unsupervised system, the system refines the phonemeregions and the phoneme rule nodes or creates new phoneme rule nodes.The unsupervised learning method described above permits experimentingwith different strategies of learning, namely increased sensitivity overtime, decreased sensitivity over time and using forgetting in theprocess of learning. It also permits experimenting with severallanguages in a multilingual system.

[0128] In an experimental setting a system is evolved on both spokenwords from New Zealand English and spoken words from Maori. Some of theevolved phoneme rule nodes are shared between the acousticalrepresentation of the languages as it is illustrated in FIG. 24 wherethe evolved rule nodes as well as a trajectory of spoken word ‘zoo’ areplotted in the 2 dimensional space of the first two principal componentsof the input acoustic space. The rule nodes in the evolved systemrepresent a compact representation of the acoustic space of the twolanguages presented to the system. The system can continuously betrained on further words of the two languages or more languages, thusrefining the acoustic space representation with the use of the sharingsounds (phonemes) principle.

[0129] The system has been subject to an experiment concerned with thetask of on-line time series prediction of the Mackey-Glass data. Herethe standard CMU benchmark format of the time series is used. The datais generated with τ=17 using a second order Runge-Kutta method with astep size of 0.1, of four inputs, namely x(t), x(t−6), x(t−12) andx(t−18) and one output namely x(t+85). Training data is from t-200 tot=3200 while test data is from t=5000 to t=5500. All 3000 training datasets were used to evolve two types of neural network modules.

[0130] For the purposes of the first and second learning algorithmsdescribed above, the following initial values of the parameters werechosen: MF=3, S=0.7, E=0.02, m=3, l=0.02, Rmax=0.2, N_(agg)=100. Thenumber of the centres and the local on-line LNDEI is calculated andcompared with the results for the RAN model as described in Platt, J “Aresource allocating network for function interpolation”, NeuralComputation 3,213-225 (1991) and modifications.

[0131] The results are shown in FIG. 25. The two modifications of thesystem result in a smaller on-line error than the other methods and in areasonable number of rule nodes. The two learning algorithms are shownas System-su and System-dp.

[0132] As the system preferably uses linear equations for calculatingthe activation of the rule nodes, rather than Gaussian functions andexponential functions as in the RAN model, the present system learningprocedure is faster than the learning procedure in the RAN model and itsmodifications. The system also produces better on-line generalisation,which is a result of more accurate node allocation during the learningprocess. This is in addition to the advantageous knowledgerepresentation features of the preferred system that includes clusteringof the input space, and rule extraction and rule insertion.

[0133] The system has also been subject to a further experiment dealingwith a classification task on a case study data of spoken digits. Thetask is recognition of speaker independent pronunciations of Englishdigits from the Otago corpus database(http://kel.otago.ac.nz/hyspeech/corpus/). Seventeen speakers (12 malesand 5 females) are used for training and a further 17 speakers (12 malesand 5 females) are used for off-line testing. Each speaker utters 30instances of English digits during a recording session in a quiet room,resulting in clean data, for a total of 510 training and 510 testingutterances. Eight mel frequency scale cepstrum coefficients (MFSCC) andlog-energy are used as acoustic features. In order to assess theperformance of the system in this application, a comparison with LinearVector Quantisation (LVQ) is accomplished. Clean training speech is usedto train both LVQ and the present system. Office noise is introduced tothe test speech data to evaluate the behaviour of the recognitionsystems in a noisy environment, with a Signal-to-Noise ratio of 10 dB.

[0134] The classification off-line test accuracy for the LVQ model andthe present system, and also the local on-line test accuracy for thesystem are evaluated and shown in FIG. 26.

[0135] The LVQ model has the following parameter values, namelycode-book vectors 396, training iterations 15840. The present system hasthe following parameter values of one training iteration, 3 MFs, 157rule nodes, initial values for S=0.9, E=0.1, l=0.01. Maximum radius isRmax=0.2 and the number of examples for aggregation N_(agg)=100.

[0136] The results show that the present system with off-line learningand testing on new data performs much better than the LVQ method asshown in FIG. 26. As the present system allows for continuous trainingon new data, further testing and also training of the system on the testdata in an on-line mode leads to a significant improvement of accuracy.

[0137] The system has also been subject to a further experiment dealingwith a classification task on a bio-informatics case study data obtainedfrom the machine learning database repository at the University ofCalifornia at Irvine. It contains primate splice-junction gene sequencesfor the identification of splice site boundaries within these sequences.In eukaryotes the genes that code for proteins are split into codingregions (exons) and noncoding regions (introns) of the DNA sequence atdefined boundaries, the so called splice sites. The data set consists of3190 DNA sequences which are 60 nucleotides long and classified eitheras an exon-intron boundary (EI), an intron-exon boundary (IE) andnon-splice site (N). The system uses 2 MF and a four bit encoding schemefor the bases.

[0138] After training the system on existing data the system is able toidentify potential splice sites within new sequences. Using a slidingwindow of 60 bases to cover the entire sequence being examined, theboundaries are identified as EI, IE, or N. A score is given to eachboundary identified that represents the likelihood that the identifiedboundary has been identified correctly. The system can be continuouslytrained on new known data sequences, thus improving its performance onunknown data sequences. At any time of the functioning of the systemknowledge can be extracted from it in the form of semanticallymeaningful rules that describe important biological relationships. Someof the extracted rules with a rule extraction threshold T1=T2=0.7 arefurther simplified, formatted and presented in a way that can beinterpreted by the user, as shown in FIG. 27. Using different ruleextraction thresholds would allow extraction of different sets of rulesthat have different levels of abstraction, thus allowing for a betterunderstanding of the gene sequences.

[0139] The system has also been subject to a further experiment dealingwith a classification task on a bio-informatics case study data which isa data set of 72 classification examples for leukemia cancer disease.The data set consists of two classes and a large input space—theexpression values of 7,129 genes monitored by Affymatrix arrays (Golubet all). The two types of leukemia are acute myeloid leukemia (AML) andacute lymphoblastic leukemia (ALL).

[0140] The task is twofold: 1) Finding a set of genes distinguishing ALLand AML, and 2) Constructing a classifier based on the expression ofthese genes allowing for new data to be entered to the system once theyhave been made available. The system accommodates or adapts this dataimproving the classification results. The system is evolved through onepass training on each consecutive example and testing it on the nextone.

[0141] During the process of on-line evolving the system learns eachexample and then attempts to predict the class of the next one. Here thesystem continually evolves with new examples accommodated, as theybecome available. At any time of the system operation rules that explainwhich genes are more closely related to each of the classes can beextracted. FIG. 28 shows two of the extracted rules after the initial 72examples are learned by the system. The rules are “local” and each ofthem has the meaning of the dominating rule in a particular cluster ofthe input space.

[0142] The system in an on-line learning mode could be used as buildingblocks for creating adaptive speech recognition systems that are basedon an evolving connectionist framework. Such systems would be able toadapt to new speakers and new accents, and add new words to theirdictionaries at any time of their operation.

[0143] Possible applications of the invention include adaptive speechrecognition in a noisy environment, adaptive spoken language evolvingsystems, adaptive process control, adaptive robot control, adaptiveknowledge based systems for learning genetic information, adaptiveagents on the Internet, adaptive systems for on-line decision making onfinancial and economic data, adaptive automatic vehicle driving systemsthat learn to navigate in a new environment (cars, helicopters, etc),and classifying bio-infomatic data.

[0144] The foregoing describes the invention including preferred formsthereof. Alterations and modifications as will be obvious to thoseskilled in the art are intended to be incorporated within the scopehereof, as defined by the accompanying claims.

1. A neural network module comprising: an input layer comprising one ormore input nodes arranged to receive input data; a rule base layercomprising one or more rule nodes; an output layer comprising one ormore output nodes; and an adaptive component arranged to aggregateselected two or more rule nodes in the rule base layer based on theinput data.
 2. A neural network module as claimed in claim 1 whereineach rule node in the rule base layer has a minimum activationthreshold, each rule node arranged to be activated where input datasatisfies the minimum activation threshold of the rule node.
 3. A neuralnetwork module as claimed in claim 2 wherein the parameters of theactivation threshold of each rule node activated by input data areadjusted based on the input data.
 4. A neural network module as claimedin claim 2 or claim 3 wherein each rule node is assigned a magnitude ofactivation when activated by input data.
 5. A neural network module asclaimed in claim 4 wherein the adaptive component is arranged toaggregate two or more rule nodes based on the magnitude of activationwhen activated by input data.
 6. A neural network module as claimed inany one of claims 2 to 5 wherein the adaptive component is arranged toincrease the minimum activation threshold of one or more rule nodes notselected for aggregation.
 7. A neural network module as claimed in anyone of claims 2 to 6 wherein the parameters of the activation thresholdof each rule node activated by input data are adjusted based on both theinput data and desired output data.
 8. A neural network module asclaimed in any one of the preceding claims wherein the adaptivecomponent is arranged to insert new rule nodes into the rule base layer.9. A neural network module as claimed in any one of the preceding claimswherein the adaptive component is arranged to extract rules from therule base layer.
 10. A neural network module as claimed in any one ofclaims 3 to 9 wherein the parameters of the activation threshold of eachrule node are adjusted based at least partially on new input data.
 11. Aneural network module as claimed in any one of claims 3 to 10 furthercomprising a memory in which is stored input data, wherein theparameters of the activation threshold of each rule node are adjustedbased at least partially on the stored input data.
 12. A neural networkmodule as claimed in any one of the preceding claims further comprisinga fuzzy input layer comprising one or more fuzzy input nodes arranged totransform input node values for use by the rule base layer.
 13. A neuralnetwork module as claimed in any one of the preceding claims furthercomprising a fuzzy output layer comprising one or more fuzzy outputnodes arranged to transform data output from the rule base layer.
 14. Anadaptive learning system comprising one or more neural network modulesas claimed in any one of the preceding claims.
 15. A method ofimplementing a neural network module comprising the steps of: arrangingan input layer comprising one or more input nodes to receive input data;arranging a rule base layer comprising one or more rule nodes; arrangingan output layer comprising one or more output nodes; and arranging anadaptive component to aggregate selected two or more rule nodes in therule base layer based on the input data.
 16. A method of implementing aneural network module as claimed in claim 15 further comprising thesteps of assigning a minimum activation threshold to each rule node inthe rule base layer; and arranging each rule node to be activated whereinput data satisfies the minimum activation threshold of the rule node.17. A method of implementing a neural network module as claimed in claim16 further comprising the step of adjusting the parameters of theactivation threshold of each rule node activated by input data based onthe input data.
 18. A method of implementing a neural network module asclaimed in claim 16 or claim 17 further comprising the step of assigningto each rule node a magnitude of activation when activated by inputdata.
 19. A method of implementing a neural network module as claimed inclaim 18 further comprising the step of arranging the adaptive componentto aggregate two or more rule nodes based on the magnitude of activationwhen activated by input data.
 20. A method of implementing a neuralnetwork module as claimed in any one of claims 16 to 19 furthercomprising the step of arranging the adaptive component to increase theminimum activation threshold of one or more rule nodes not selected foraggregation.
 21. A method of implementing a neural network module asclaimed in any one of claims 16 to 20 further comprising the step ofadjusting the parameters of the activation threshold of each rule nodeactivated by input data based on both the input data and desired outputdata.
 22. A method of implementing a neural network module as claimed inany one of claims 15 to 21 further comprising the step of arranging theadaptive component to insert new rule nodes into the rule base layer.23. A method of implementing a neural network module as claimed in anyone of claims 15 to 22 further comprising the step of arranging theadaptive component to extract rules from the rule base layer.
 24. Amethod of implementing a neural network module as claimed in any one ofclaims 17 to 23 further comprising the step of adjusting the parametersof the activation threshold of each rule node based at least partiallyon new input data.
 25. A method of implementing a neural network moduleas claimed in any one of claims 17 to 24 further comprising the steps ofmaintaining in a memory input data; and adjusting the parameters ofactivation threshold of each rule node based at least partially on thestored input data.
 26. A method of implementing a neural network moduleas claimed in any one of claims 15 to 25 further comprising the step ofarranging a fuzzy input layer comprising one or more fuzzy input nodesto transform input node values for use by the rule base layer.
 27. Amethod of implementing a neural network module as claimed in any one ofclaims 15 to 26 further comprising the steps of arranging a fuzzy outputlayer comprising one or more fuzzy output nodes to transform data outputfrom the rule base layer.
 28. A neural network computer programcomprising: an input layer comprising one or more input nodes arrangedto receive input data; a rule base layer comprising one or more rulenodes; an output layer comprising one or more output nodes; and anadaptive component arranged to aggregate selected two or more rule nodesin the rule base layer based on the input data.
 29. A neural networkcomputer program as claimed in claim 28 wherein each rule node in therule base layer has a minimum activation threshold, each rule nodearranged to be activated where input data satisfies the minimumactivation threshold of the rule node.
 30. A neural network computerprogram as claimed in claim 29 wherein the parameters of the activationthreshold of each rule node activated by input data are adjusted basedon the input data.
 31. A neural network module as claimed in claim 29 orclaim 30 wherein each rule node is assigned a magnitude of activationwhen activated by input data.
 32. A neural network computer program asclaimed in claim 31 wherein the adaptive component is arranged toaggregate two or more rule nodes based on the magnitude of activationwhen activated by input data.
 33. A neural network computer program asclaimed in any one of claims 29 to 32 wherein the adaptive component isarranged to increase the minimum activation threshold of one or morerule nodes not selected for aggregation.
 34. A neural network computerprogram as claimed in any one of claims 29 to 33 wherein the parametersof the activation threshold of each rule node activated by input dataare adjusted based on both the input data and desired output data.
 35. Aneural network computer program as claimed in any one of claims 28 to 34wherein the adaptive component is arranged to insert new rule nodes intothe rule base layer.
 36. A neural network computer program as claimed inany one of claims 28 to 35 wherein the adaptive component is arranged toextract rules from the rule base layer.
 37. A neural network computerprogram as claimed in any one of claims 30 to 36 wherein the parametersof activation threshold of each rule node are adjusted based at leastpartially on new input data.
 38. A neural network computer program asclaimed in any one of claims 30 to 37 further comprising input datastored in a memory, the parameters of the activation threshold of eachrule node arranged to be adjusted based at least partially on the storedinput data.
 39. A neural network computer program as claimed in any oneof claims 28 to 38 further comprising a fuzzy input layer comprising oneor more fuzzy input nodes arranged to transform input node values foruse by the rule base layer.
 40. A neural network computer program asclaimed in any one of claims 28 to 39 further comprising a fuzzy outputlayer comprising one or more fuzzy output nodes arranged to transformdata output from the rule base layer.
 41. An adaptive learning computerprogram comprising one or more neural network programs as claimed in anyone of claims 28 to
 40. 42. A neural network computer program as claimedin any one of claims 28 to 40 embodied on a computer-readable medium.43. An adaptive learning computer program as claimed in claim 41embodied on a computer-readable medium.